| Name | Student ID |
|---|---|
| I.M.S. Wimalasiri | 29785 |
| Y.K.A. Rathnasiri | 27413 |
| O.V.D.K.R. Weerasena | 27986 |
| S.S.T. Silva | 30017 |
Social media platforms have become deeply embedded in the daily lives of individuals, particularly among younger generations. While these platforms serve various positive functions such as communication, entertainment, and education, excessive and uncontrolled usage can lead to addictive behaviors, decreased productivity, and negative mental health outcomes. Despite growing awareness, many users underestimate the impact of their social media habits. This study investigates patterns of social media use, focusing on the degree of time-wasting and the prevalence of addiction among users.
The dataset used for this project is titled “Time-Wasters & Addiction on Social Media” and contains 1,000 objects collected from users across different countries. The dataset consists:
AgeGenderLocationProfessionPlatform – Primary social media platform used (e.g.,
TikTok, YouTube)Total Time Spent – Daily time spent on social media (in
minutes)Frequency – How often participants use social media
(e.g., hourly, daily)Number of Videos Watched – Count of videos consumed per
dayVideo Category – Type of content viewed (e.g., comedy,
education, news)DeviceType – Device used for accessing social media
(e.g., Mobile, Desktop)Watch Time – Time of day when social media is used
(e.g., Morning, Night)CurrentActivity – Activity being disrupted or
paralleled (e.g., studying, relaxing)Self Control – Self-reported control over usage (scale
1–10)Addiction Level – Measured addiction score (scale
0–7)Watch Reason – Main motivation for usage (e.g.,
boredom, habit, entertainment)The primary objective of this project is to analyze social media usage behavior and identify patterns linked to time-wasting and addiction. Using data visualization and statistical techniques in R, the analysis aims to:
# Loading all required libraries.
# Data manipulation and wrangling
library(tidyverse) # Includes ggplot2, dplyr, tidyr, etc.
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) # Data manipulation
library(tidyr) # Data reshaping
library(lubridate) # Provides intuitive functions that simplify parsing, extracting, manipulating, and analyzing date-time data.
# Visualization
library(ggplot2) # Core plotting system
library(gridExtra) # Arrange multiple plots in a grid
## Warning: package 'gridExtra' was built under R version 4.3.3
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(viridis) # Color palettes for colorblind-friendly plots
## Warning: package 'viridis' was built under R version 4.3.3
## Loading required package: viridisLite
library(RColorBrewer) # Additional qualitative and sequential color palettes
library(scales) # Axis formatting and scaling
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:viridis':
##
## viridis_pal
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(ggalluvial) # Alluvial/Sankey diagrams
## Warning: package 'ggalluvial' was built under R version 4.3.3
library(ggridges) # Ridge density plots
## Warning: package 'ggridges' was built under R version 4.3.3
library(gganimate) # Animated visualizations
## Warning: package 'gganimate' was built under R version 4.3.3
library(ggcorrplot) # Correlation matrix visualization
## Warning: package 'ggcorrplot' was built under R version 4.3.3
library(GGally) # Extended ggplot2 plots (e.g., ggpairs for pair plots)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(hrbrthemes) # Clean and modern themes for ggplot2
## Warning: package 'hrbrthemes' was built under R version 4.3.3
library(ggiraph) # Interactive ggplot2 graphics
## Warning: package 'ggiraph' was built under R version 4.3.3
library(plotly) # Interactive plots
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(patchwork) # Combine multiple ggplot2 plots into a single cohesive layout
## Warning: package 'patchwork' was built under R version 4.3.3
# Statistical analysis and diagnostics
library(corrplot) # Correlation plots (circle/heatmap style)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
library(car) # Companion to Applied Regression (e.g., VIF)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(lmtest) # Diagnostic tests for linear models
## Warning: package 'lmtest' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(psych) # Descriptive statistics and psychometrics
## Warning: package 'psych' was built under R version 4.3.3
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:car':
##
## logit
##
## The following objects are masked from 'package:scales':
##
## alpha, rescale
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(rstatix) # Easy statistical tests and results formatting
## Warning: package 'rstatix' was built under R version 4.3.3
##
## Attaching package: 'rstatix'
##
## The following object is masked from 'package:ggcorrplot':
##
## cor_pmat
##
## The following object is masked from 'package:stats':
##
## filter
library(effectsize) # Effect size calculations
## Warning: package 'effectsize' was built under R version 4.3.3
##
## Attaching package: 'effectsize'
##
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
##
## The following object is masked from 'package:psych':
##
## phi
library(emmeans) # Estimated marginal means and post-hoc comparisons
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
##
## Attaching package: 'emmeans'
##
## The following object is masked from 'package:GGally':
##
## pigs
library(moments) # Skewness, kurtosis, and moment statistics
library(sandwich) # Compute robust covariance matrix estimators for linear regression models
## Warning: package 'sandwich' was built under R version 4.3.3
# Missing data
library(naniar) # Visualization of missing values
## Warning: package 'naniar' was built under R version 4.3.3
# Maps and Geo-Spacial Analysis
library(maps) # Map outlines
## Warning: package 'maps' was built under R version 4.3.3
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:viridis':
##
## unemp
##
## The following object is masked from 'package:purrr':
##
## map
library(mapdata) # Additional map details
## Warning: package 'mapdata' was built under R version 4.3.3
library(rworldmap) # World mapping tools
## Warning: package 'rworldmap' was built under R version 4.3.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.3
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
library(treemap) # Treemap visualizations
## Warning: package 'treemap' was built under R version 4.3.3
library(networkD3) # D3.js-based interactive network diagrams
# Country code conversion
library(countrycode) # Convert country names to ISO codes
## Warning: package 'countrycode' was built under R version 4.3.3
# Radar/fmsb charts
library(fmsb) # Radar charts for multivariate data
## Warning: package 'fmsb' was built under R version 4.3.3
##
## Attaching package: 'fmsb'
##
## The following objects are masked from 'package:effectsize':
##
## oddsratio, riskratio
df1 <- read_csv("D:/OneDrive - NSBM/NSBM 22.2/Year 3 Semester 1/DS304.3-Data Visualization/Group Project/Ect/Time-Wasters & Addiction on Social Media.csv")
## Rows: 1000 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, Location, Profession, Platform, Video Category, Frequency,...
## dbl (6): Age, Total Time Spent, Video Length, Number of Videos Watched, Sel...
## time (1): Watch Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- read_csv("D:/OneDrive - NSBM/NSBM 22.2/Year 3 Semester 1/DS304.3-Data Visualization/Group Project/Ect/Untitledhttps___nsbm365-my.sharepoint.com__x__g_personal_desilvadsd_students_nsbm_ac_lk_ERjiMVt_JrlFkw.csv")
## Rows: 30 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Gender, Location ( e.g., Sri Lankan), Profession (student, Artis...
## dbl (4): Age, Number of videos you watch per day, Rate your self-control w...
## time (1): What time do you usually watch social media videos?
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df1)
## # A tibble: 6 × 16
## Age Gender Location Profession Platform `Total Time Spent` `Video Category`
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 56 Male Pakistan Engineer Instagr… 80 Pranks
## 2 46 Female Mexico Artist Instagr… 228 Pranks
## 3 32 Female United S… Engineer Facebook 30 Vlogs
## 4 60 Male Barzil Waiting s… YouTube 101 Vlogs
## 5 25 Male Pakistan Manager TikTok 136 Gaming
## 6 38 Male Vietnam driver Instagr… 89 Jokes/Memes
## # ℹ 9 more variables: `Video Length` <dbl>, `Number of Videos Watched` <dbl>,
## # Frequency <chr>, `Watch Reason` <chr>, DeviceType <chr>,
## # `Watch Time` <time>, `Self Control` <dbl>, `Addiction Level` <dbl>,
## # CurrentActivity <chr>
head(df2)
## # A tibble: 6 × 16
## Age Gender `Location ( e.g., Sri Lankan)` `Profession (student, Artist)`
## <dbl> <chr> <chr> <chr>
## 1 23 Male Sri Lankan Student
## 2 27 Male Sri Lankan Student
## 3 23 Male Sri Lankan Student
## 4 23 Male Sri Lankan Student
## 5 24 Female Sri Lankan Studnet
## 6 23 Male Sri Lankan student
## # ℹ 12 more variables: `Which platform do you mostly use?` <chr>,
## # `Total time spent on social media per day (in minutes)` <chr>,
## # `What type of video content do you watch most?` <chr>,
## # `Average video length you watch (in minutes)` <chr>,
## # `Number of videos you watch per day` <dbl>,
## # `What time of day do you mostly use social media?` <chr>,
## # `Why do you usually watch videos on social media?` <chr>, …
# Finding a common column to merge on
print(colnames(df1))
## [1] "Age" "Gender"
## [3] "Location" "Profession"
## [5] "Platform" "Total Time Spent"
## [7] "Video Category" "Video Length"
## [9] "Number of Videos Watched" "Frequency"
## [11] "Watch Reason" "DeviceType"
## [13] "Watch Time" "Self Control"
## [15] "Addiction Level" "CurrentActivity"
# Finding a common column to merge on
print(colnames(df2))
## [1] "Age"
## [2] "Gender"
## [3] "Location ( e.g., Sri Lankan)"
## [4] "Profession (student, Artist)"
## [5] "Which platform do you mostly use?"
## [6] "Total time spent on social media per day (in minutes)"
## [7] "What type of video content do you watch most?"
## [8] "Average video length you watch (in minutes)"
## [9] "Number of videos you watch per day"
## [10] "What time of day do you mostly use social media?"
## [11] "Why do you usually watch videos on social media?"
## [12] "Device you use most for watching videos"
## [13] "What time do you usually watch social media videos?"
## [14] "Rate your self-control when it comes to using social media"
## [15] "Rate your level of addiction to social media"
## [16] "What are you usually doing while using social media?"
# Rename df2 columns to match df1
colnames(df2) <- c(
"Age",
"Gender",
"Location",
"Profession",
"Platform",
"Total Time Spent",
"Video Category",
"Video Length",
"Number of Videos Watched",
"Frequency",
"Watch Reason",
"DeviceType",
"Watch Time",
"Self Control",
"Addiction Level",
"CurrentActivity"
)
colnames(df2)
## [1] "Age" "Gender"
## [3] "Location" "Profession"
## [5] "Platform" "Total Time Spent"
## [7] "Video Category" "Video Length"
## [9] "Number of Videos Watched" "Frequency"
## [11] "Watch Reason" "DeviceType"
## [13] "Watch Time" "Self Control"
## [15] "Addiction Level" "CurrentActivity"
# Converting df2 to numeric
df2$`Total Time Spent` <- as.numeric(gsub("[^0-9.]", "", df2$`Total Time Spent`))
df2$`Video Length` <- as.numeric(gsub("[^0-9.]", "", df2$`Video Length`))
df2$`Number of Videos Watched` <- as.numeric(gsub("[^0-9.]", "", df2$`Number of Videos Watched`))
# Merging after fixing types
merged_df <- full_join(df1, df2, by = c(
"Age", "Gender", "Location", "Profession", "Platform", "Total Time Spent",
"Video Category", "Video Length", "Number of Videos Watched", "Frequency",
"Watch Reason", "DeviceType", "Watch Time", "Self Control",
"Addiction Level", "CurrentActivity"
))
write_csv(merged_df, "merged_dataset.csv")
data <- read.csv("merged_dataset.csv", stringsAsFactors = FALSE)
str(data)
## 'data.frame': 1030 obs. of 16 variables:
## $ Age : int 56 46 32 60 25 38 56 36 40 28 ...
## $ Gender : chr "Male" "Female" "Female" "Male" ...
## $ Location : chr "Pakistan" "Mexico" "United States" "Barzil" ...
## $ Profession : chr "Engineer" "Artist" "Engineer" "Waiting staff" ...
## $ Platform : chr "Instagram" "Instagram" "Facebook" "YouTube" ...
## $ Total.Time.Spent : int 80 228 30 101 136 89 247 191 34 165 ...
## $ Video.Category : chr "Pranks" "Pranks" "Vlogs" "Vlogs" ...
## $ Video.Length : int 24 19 19 4 19 9 16 27 4 6 ...
## $ Number.of.Videos.Watched: int 22 31 7 41 21 16 20 43 44 47 ...
## $ Frequency : chr "Night" "Afternoon" "Evening" "Night" ...
## $ Watch.Reason : chr "Procrastination" "Habit" "Entertainment" "Habit" ...
## $ DeviceType : chr "Smartphone" "Computer" "Tablet" "Smartphone" ...
## $ Watch.Time : chr "21:00:00" "17:00:00" "14:00:00" "21:00:00" ...
## $ Self.Control : int 5 7 8 5 10 5 10 5 5 8 ...
## $ Addiction.Level : int 5 3 2 5 0 5 0 5 5 2 ...
## $ CurrentActivity : chr "Commuting" "At school" "At home" "Commuting" ...
summary(data)
## Age Gender Location Profession
## Min. :18.00 Length:1030 Length:1030 Length:1030
## 1st Qu.:28.00 Class :character Class :character Class :character
## Median :41.00 Mode :character Mode :character Mode :character
## Mean :40.48
## 3rd Qu.:52.00
## Max. :64.00
## Platform Total.Time.Spent Video.Category Video.Length
## Length:1030 Min. : 10.0 Length:1030 Min. : 1.00
## Class :character 1st Qu.: 76.0 Class :character 1st Qu.: 8.00
## Mode :character Median :146.0 Mode :character Median : 15.00
## Mean :148.8 Mean : 27.33
## 3rd Qu.:221.0 3rd Qu.: 22.00
## Max. :298.0 Max. :3040.00
## Number.of.Videos.Watched Frequency Watch.Reason
## Min. : 1.00 Length:1030 Length:1030
## 1st Qu.:14.00 Class :character Class :character
## Median :25.00 Mode :character Mode :character
## Mean :25.13
## 3rd Qu.:37.00
## Max. :80.00
## DeviceType Watch.Time Self.Control Addiction.Level
## Length:1030 Length:1030 Min. : 3.000 Min. :0.000
## Class :character Class :character 1st Qu.: 5.000 1st Qu.:2.000
## Mode :character Mode :character Median : 7.000 Median :3.000
## Mean : 7.071 Mean :2.983
## 3rd Qu.: 8.000 3rd Qu.:5.000
## Max. :10.000 Max. :9.000
## CurrentActivity
## Length:1030
## Class :character
## Mode :character
##
##
##
missing_values <- sapply(data, function(x) sum(is.na(x) | x == ""))
print(missing_values)
## Age Gender Location
## 0 0 0
## Profession Platform Total.Time.Spent
## 0 0 0
## Video.Category Video.Length Number.of.Videos.Watched
## 0 0 0
## Frequency Watch.Reason DeviceType
## 0 0 0
## Watch.Time Self.Control Addiction.Level
## 0 0 0
## CurrentActivity
## 0
gg_miss_var(data) +
labs(title = "Missing Values by Variable",
y = "Variables",
x = "Count of Missing Values") +
theme_minimal()
# replacing median values with NA vales of numeric columns,
numeric_cols <- c("Age", "Total.Time.Spent", "Video.Length",
"Number.of.Videos.Watched", "Self.Control", "Addiction.Level")
for (col in numeric_cols) {
if (missing_values[col] > 0) {
data[is.na(data[,col]) | data[,col] == "", col] <- median(data[,col], na.rm = TRUE)
}
}
# replacing mode values with NA vales of categorical columns,
categorical_cols <- c("Gender", "Location", "Profession", "Platform",
"Video.Category", "Frequency", "Watch.Reason",
"DeviceType", "Watch.Time", "CurrentActivity")
get_mode <- function(v) {
uniqv <- unique(na.omit(v))
uniqv[which.max(tabulate(match(v, uniqv)))]
}
for (col in categorical_cols) {
if (missing_values[col] > 0) {
mode_value <- get_mode(data[,col])
data[is.na(data[,col]) | data[,col] == "", col] <- mode_value
}
}
# Converting categorical variables to factors for better analysis
data <- data %>%
mutate(
Gender = as.factor(Gender),
Location = as.factor(Location),
Profession = as.factor(Profession),
Platform = as.factor(Platform),
Video.Category = as.factor(Video.Category),
Frequency = as.factor(Frequency),
Watch.Reason = as.factor(Watch.Reason),
DeviceType = as.factor(DeviceType),
Watch.Time = as.factor(Watch.Time),
CurrentActivity = as.factor(CurrentActivity)
)
# Extracting hour from Watch.Time for better analysis
# Converting Watch.Time to proper time format
data$Watch.Time <- as.character(data$Watch.Time)
data$Hour <- as.numeric(substr(data$Watch.Time, 1, 2))
# Checking for outliers in numeric variables
boxplot_list <- list()
for (col in numeric_cols) {
p <- ggplot(data, aes_string(y = col)) +
geom_boxplot(fill = "#69b3a2") +
theme_minimal() +
labs(title = paste("Boxplot of", col))
boxplot_list[[col]] <- p
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
do.call(grid.arrange, c(boxplot_list, ncol = 2))
platform_summary <- data %>%
group_by(Platform) %>%
summarise(
Count = n(),
Mean_Age = mean(Age, na.rm = TRUE),
Mean_Time_Spent = mean(Total.Time.Spent, na.rm = TRUE),
Mean_Videos_Watched = mean(Number.of.Videos.Watched, na.rm = TRUE),
Mean_Addiction_Level = mean(Addiction.Level, na.rm = TRUE),
Mean_Self_Control = mean(Self.Control, na.rm = TRUE)
)
print(platform_summary)
## # A tibble: 4 × 7
## Platform Count Mean_Age Mean_Time_Spent Mean_Videos_Watched
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Facebook 221 40.3 155. 24.1
## 2 Instagram 266 40.8 144. 25.8
## 3 TikTok 277 41.1 150. 24.7
## 4 YouTube 266 39.7 147. 25.7
## # ℹ 2 more variables: Mean_Addiction_Level <dbl>, Mean_Self_Control <dbl>
# Age distribution
age_plot <- ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "#69b3a2", color = "white", alpha = 0.7) +
stat_bin(binwidth = 5, geom = "text", aes(label = ..count..), vjust = 1.5) +
labs(title = "Age Distribution",
x = "Age",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Platform distribution
platform_plot <- ggplot(data, aes(x = Platform, fill = Platform)) +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), vjust = 1.5, color = "white") +
scale_fill_viridis_d() +
labs(title = "Distribution by Platform",
x = "Platform",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
# Addiction level distribution
addiction_plot <- ggplot(data, aes(x = Addiction.Level)) +
geom_histogram(binwidth = 1, fill = "#ff7f0e", color = "white", alpha = 0.7) +
stat_bin(binwidth = 1, geom = "text", aes(label = ..count..), vjust = 1.5, color = "black") +
labs(title = "Addiction Level Distribution",
x = "Addiction Level",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Time spent distribution
time_plot <- ggplot(data, aes(x = Total.Time.Spent)) +
geom_histogram(binwidth = 10, fill = "#2ca02c", color = "white", alpha = 0.7) +
labs(title = "Total Time Spent Distribution",
x = "Total Time Spent (minutes)",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
basic_dist_plots <- grid.arrange(age_plot, platform_plot, addiction_plot, time_plot, ncol = 2)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggsave("basic_distributions.png", basic_dist_plots, width = 12, height = 10)
# mean values per Platform
mean_stats <- data %>%
group_by(Platform) %>%
summarise(
mean_addiction = mean(Addiction.Level, na.rm = TRUE),
mean_time = mean(Total.Time.Spent, na.rm = TRUE),
mean_videos = mean(Number.of.Videos.Watched, na.rm = TRUE)
)
# Addiction level by platform
platform_addiction <- ggplot(data, aes(x = Platform, y = Addiction.Level, fill = Platform)) +
geom_boxplot(alpha = 0.7) +
geom_text(data = mean_stats, aes(x = Platform, y = mean_addiction, label = round(mean_addiction, 1)),
vjust = -0.5, color = "black", fontface = "bold") +
scale_fill_viridis_d() +
labs(title = "Addiction Level by Platform", x = "Platform", y = "Addiction Level") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
# Time spent by platform
platform_time <- ggplot(data, aes(x = Platform, y = Total.Time.Spent, fill = Platform)) +
geom_boxplot(alpha = 0.7) +
geom_text(data = mean_stats, aes(x = Platform, y = mean_time, label = round(mean_time, 1)),
vjust = -0.5, color = "black", fontface = "bold") +
scale_fill_viridis_d() +
labs(title = "Total Time Spent by Platform", x = "Platform", y = "Total Time Spent (minutes)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
# Videos watched by platform
platform_videos <- ggplot(data, aes(x = Platform, y = Number.of.Videos.Watched, fill = Platform)) +
geom_boxplot(alpha = 0.7) +
geom_text(data = mean_stats, aes(x = Platform, y = mean_videos, label = round(mean_videos, 1)),
vjust = -0.5, color = "black", fontface = "bold") +
scale_fill_viridis_d() +
labs(title = "Number of Videos Watched by Platform", x = "Platform", y = "Number of Videos Watched") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
platform_comp_plots <- grid.arrange(platform_addiction, platform_time, platform_videos, ncol = 2)
These boxplots compare Addiction Level, Total Time Spent, and Number of Videos Watched across four major platforms: Facebook, Instagram, TikTok, and YouTube.
ggsave("platform_comparisons.png", platform_comp_plots, width = 15, height = 5)
# Watch reason by platform
reason_platform <- ggplot(data, aes(x = Watch.Reason, fill = Platform)) +
geom_bar(position = "dodge") +
scale_fill_viridis_d() +
labs(title = "Watch Reason by Platform",
x = "Watch Reason",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))
# Hour of day analysis
hour_platform <- ggplot(data, aes(x = Total.Time.Spent, fill = Platform)) +
geom_histogram(position = "dodge", binwidth = 1, alpha = 0.7) +
scale_fill_viridis_d() +
labs(title = "Watch Time by Platform",
x = "Hour of Day",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Video category by platform
category_platform <- ggplot(data, aes(x = Video.Category, fill = Platform)) +
geom_bar(position = "dodge") +
scale_fill_viridis_d() +
labs(title = "Video Category by Platform",
x = "Video Category",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))
# Gender distribution by platform
gender_platform <- ggplot(data, aes(x = Gender, fill = Platform)) +
geom_bar(position = "dodge") +
scale_fill_viridis_d() +
labs(title = "Gender Distribution by Platform",
x = "Gender",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
categorical_plots1 <- grid.arrange(reason_platform, category_platform, ncol = 1)
categorical_plots2 <- grid.arrange(hour_platform, gender_platform, ncol = 1)
This section analyzes how different platforms are used based on user motivations, content preferences, watching time behavior, and gender distribution.
Insight: Social media is most often used out of habit and entertainment needs, implying habitual engagement rather than intentional use.
Insight: Users gravitate toward light-hearted or practical content, and content preferences vary by platform—e.g., TikTok users favor memes, while YouTube sees more gaming and educational content.
Insight: Users access platforms consistently throughout the day, though high-frequency use in shorter bursts is especially visible for short-form video apps like TikTok.
Insight: The dataset indicates a gender imbalance, which may affect addiction patterns, content preference, and platform usage style.
| Aspect | Notable Findings |
|---|---|
| Watch Reason | Habit, boredom, and entertainment are top motivations |
| Content Preference | Jokes/Memes and Life Hacks dominate, esp. on TikTok |
| Watch Time Behavior | Peaks in short bursts; consistent usage across the day |
| Gender Composition | Higher male usage on all platforms |
video_category_counts <- data %>%
count(Video.Category) %>%
arrange(desc(n))
interactive_categories <- plot_ly(
video_category_counts,
x = ~Video.Category,
y = ~n,
type = "bar",
marker = list(
color = viridis(length(unique(data$Video.Category))),
line = list(color = "rgb(8,48,107)", width = 1.5)
),
text = ~n,
textposition = "auto"
) %>%
layout(
title = "Video Category Distribution",
xaxis = list(title = "Video Category", tickangle = 45),
yaxis = list(title = "Count"),
annotations = list(
x = 1, y = -0.15,
text = " ",
showarrow = F,
xref = 'paper', yref = 'paper'
)
)
interactive_categories
This bar chart displays the frequency of videos watched across various content categories:
print(levels(data$Gender))
## [1] "Female" "Male" "Other"
print(table(data$Gender))
##
## Female Male Other
## 326 540 164
# removing "other" from gender
data$Gender_fixed <- as.factor(ifelse(data$Gender %in% c("Male", "male"), "Male",
ifelse(data$Gender %in% c("Female", "female"), "Female", NA)))
data <- data[!is.na(data$Gender_fixed), ]
print("Fixed Gender variable levels:")
## [1] "Fixed Gender variable levels:"
print(levels(data$Gender_fixed))
## [1] "Female" "Male"
print(table(data$Gender_fixed))
##
## Female Male
## 326 540
platform_gender <- data %>%
count(Platform, Gender_fixed) %>%
group_by(Platform)
interactive_platform_gender <- plot_ly(
platform_gender,
x = ~Platform,
y = ~n,
color = ~Gender_fixed,
type = "bar",
text = ~n,
textposition = "auto",
colors = viridis(length(unique(data$Gender_fixed)))
) %>%
layout(
title = "Platform Usage by Gender",
xaxis = list(title = "Platform"),
yaxis = list(title = "Count"),
barmode = "stack"
)
interactive_platform_gender
This chart displays social media platform preferences segmented by gender:
This section presents multi-dimensional comparisons across social media platforms using advanced statistical visualizations. Each plot highlights behavioral and psychological trends among users.
# Preparing data for parallel coordinates
parallel_data <- data %>%
select(Platform, Age, Total.Time.Spent, Number.of.Videos.Watched,
Self.Control, Addiction.Level) %>%
group_by(Platform) %>%
summarise(
Mean_Age = mean(Age, na.rm = TRUE),
Mean_Time_Spent = mean(Total.Time.Spent, na.rm = TRUE),
Mean_Videos = mean(Number.of.Videos.Watched, na.rm = TRUE),
Mean_Self_Control = mean(Self.Control, na.rm = TRUE),
Mean_Addiction = mean(Addiction.Level, na.rm = TRUE)
) %>%
ungroup()
# Reshaping for parallel coordinates
parallel_data_long <- parallel_data %>%
pivot_longer(cols = -Platform,
names_to = "Variable",
values_to = "Value")
# Scaling values to 0-1 for better comparison
parallel_data_long <- parallel_data_long %>%
group_by(Variable) %>%
mutate(Scaled_Value = (Value - min(Value)) / (max(Value) - min(Value))) %>%
ungroup()
parallel_plot <- ggplot(parallel_data_long,
aes(x = Variable, y = Scaled_Value, group = Platform, color = Platform)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
scale_color_viridis_d() +
labs(title = "Parallel Coordinate Plot: Key Metrics by Platform",
subtitle = "Values scaled to 0-1 range for comparison",
x = "Variable",
y = "Scaled Value") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(parallel_plot)
Insight: Platforms vary significantly in usage styles. Instagram is fast-paced and addictive, YouTube is intentional and controlled, Facebook is high-time but passive.
# subset of the data
data_subset <- data %>%
sample_n(min(500, nrow(data))) %>%
select(Platform, Age, Total.Time.Spent, Number.of.Videos.Watched,
Self.Control, Addiction.Level)
scatter_matrix <- ggpairs(
data_subset,
columns = 2:6,
mapping = aes(color = Platform),
upper = list(continuous = wrap("cor", size = 3)),
diag = list(continuous = wrap("densityDiag")),
lower = list(continuous = wrap("points", alpha = 0.5, size = 0.8)),
title = "Scatter Plot Matrix by Platform"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
scale_color_viridis_d() +
scale_fill_viridis_d()
print(scatter_matrix)
Self-Control vs Addiction Level:
Strong negative (e.g., -0.96 for Instagram)Videos Watched vs Addiction Level:
Moderate positive for TikTokInsight: Psychological and behavioral patterns vary across platforms, with Instagram and TikTok showing tighter coupling between usage and addiction.
ridge_addiction <- ggplot(data, aes(x = Addiction.Level, y = Platform, fill = Platform)) +
geom_density_ridges(alpha = 0.7, scale = 2, quantile_lines = TRUE, quantiles = 2) +
scale_fill_viridis_d() +
labs(title = "Distribution of Addiction Levels by Platform",
subtitle = "With median lines",
x = "Addiction Level",
y = "Platform") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "none"
)
print(ridge_addiction)
## Picking joint bandwidth of 0.654
Insight: Algorithm-heavy platforms lead to higher reported addiction levels.
prop_data <- data %>%
count(Platform, Video.Category) %>%
group_by(Platform) %>%
mutate(Proportion = n / sum(n)) %>%
ungroup()
prop_bars <- ggplot(prop_data, aes(x = Platform, y = Proportion, fill = Video.Category)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_viridis_d() +
labs(title = "Proportional Distribution of Video Categories by Platform",
x = "Platform",
y = "Proportion") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 0)
)
print(prop_bars)
Jokes/Memes, Life Hacks,
TrendsGaming,
VlogsInsight: Platform algorithms reinforce specific content types.
alluvial_data <- data %>%
group_by(Platform, Video.Category, Watch.Reason) %>%
summarise(Count = n(), .groups = 'drop')
alluvial_plot <- ggplot(alluvial_data,
aes(axis1 = Platform, axis2 = Video.Category, axis3 = Watch.Reason,
y = Count)) +
geom_alluvium(aes(fill = Platform), width = 1/12) +
geom_stratum(width = 1/12, fill = "white", color = "grey") +
geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
scale_fill_viridis_d() +
labs(title = "Flow Diagram: Platform → Video Category → Watch Reason",
x = "",
y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
)
print(alluvial_plot)
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
Jokes/Memes → HabitGaming → EntertainmentLife Hacks → BoredomInsight: Most users are driven by habit and entertainment, with platform-personalized content driving reasons.
hour_platform_heatmap <- data %>%
count(Total.Time.Spent, Platform) %>%
ggplot(aes(x = Total.Time.Spent, y = Platform, fill = n)) +
geom_tile() +
scale_fill_viridis_c(name = "Count") +
scale_x_continuous(breaks = 0:23) +
labs(title = "Heatmap: Watch Hours by Platform",
x = "Hour of Day",
y = "Platform") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold")
)
print(hour_platform_heatmap)
Insight: Usage sessions on TikTok and YouTube may be shorter but more frequent or intensive.
bubble_chart <- ggplot(data, aes(x = Total.Time.Spent, y = Number.of.Videos.Watched,
size = Addiction.Level, color = Platform)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(1, 10)) +
scale_color_viridis_d() +
labs(title = "Relationship: Time Spent, Videos Watched & Addiction Level",
x = "Total Time Spent (minutes)",
y = "Number of Videos Watched",
size = "Addiction Level") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold")
)
print(bubble_chart)
Insight: There’s a clear volume-addiction connection — the more time and content, the higher the addiction.
violin_plot <- ggplot(data, aes(x = Platform, y = Addiction.Level, fill = Gender)) +
geom_violin(position = position_dodge(0.7), alpha = 0.7) +
geom_boxplot(position = position_dodge(0.7), width = 0.1, alpha = 0.7) +
scale_fill_viridis_d() +
labs(title = "Addiction Level by Platform and Gender",
x = "Platform",
y = "Addiction Level") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold")
)
print(violin_plot)
Insight: Addiction distribution varies more in marginalized or less represented gender categories.
# Function to calculate correlation for each platform
get_platform_cor <- function(platform_name) {
platform_data <- data %>%
filter(Platform == platform_name) %>%
select(Age, Total.Time.Spent, Video.Length, Number.of.Videos.Watched,
Self.Control, Addiction.Level)
cor_matrix <- cor(platform_data, use = "pairwise.complete.obs")
cor_df <- as.data.frame(as.table(cor_matrix))
cor_df$Platform <- platform_name
return(cor_df)
}
# Applying function to each platform
platforms <- unique(data$Platform)
cor_data <- do.call(rbind, lapply(platforms, get_platform_cor))
names(cor_data) <- c("Var1", "Var2", "Correlation", "Platform")
cor_heatmap <- ggplot(cor_data, aes(x = Var1, y = Var2, fill = Correlation)) +
geom_tile() +
facet_wrap(~ Platform) +
scale_fill_gradient2(low = "#1b9e77", mid = "white", high = "#d95f02",
midpoint = 0, limits = c(-1, 1)) +
labs(title = "Correlation Heatmaps by Platform",
x = "",
y = "") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
print(cor_heatmap)
| Platform | Positive Correlations | Negative Correlation |
|---|---|---|
| Time ↔︎ Addiction | Self-Control ↔︎ Addiction | |
| Videos ↔︎ Addiction | Self-Control ↔︎ Addiction | |
| TikTok | Time & Videos ↔︎ Addiction | Self-Control ↔︎ Addiction |
| YouTube | Videos ↔︎ Addiction | Self-Control ↔︎ Addiction |
Insight: Across all platforms, Self-Control is consistently and negatively correlated with Addiction Level, supporting the psychological consistency of the data.
This section analyzes distribution of Total Time Spent (in minutes) across different platforms:
ridge_plot <- ggplot(data, aes(x = Total.Time.Spent, y = Platform, fill = Platform)) +
geom_density_ridges(alpha = 0.7, scale = 2) +
scale_fill_viridis_d() +
labs(title = "Distribution of Time Spent by Platform",
x = "Total Time Spent (minutes)",
y = "Platform") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
)
print(ridge_plot)
## Picking joint bandwidth of 26.1
treemap_data <- data %>%
group_by(Video.Category, Platform) %>%
summarise(
Total_Videos = sum(Number.of.Videos.Watched, na.rm = TRUE),
.groups = 'drop'
)
treemap(
treemap_data,
index = c("Video.Category", "Platform"),
vSize = "Total_Videos",
type = "index",
palette = viridis(length(unique(treemap_data$Video.Category))),
title = "Distribution of Videos Watched by Category and Platform",
fontsize.labels = c(14, 10),
fontcolor.labels = c("white", "black"),
border.col = c("white", "gray70"),
inflate.labels = TRUE
)
The treemap above illustrates the distribution of video content types across social media platforms:
# data for Sankey diagram
network_data <- data %>%
group_by(Watch.Reason, Platform) %>%
summarise(value = n(), .groups = 'drop')
# nodes and links for the network diagram
all_nodes <- unique(c(as.character(network_data$Watch.Reason),
as.character(network_data$Platform)))
nodes_df <- data.frame(
name = all_nodes,
group = c(rep("Reason", length(unique(network_data$Watch.Reason))),
rep("Platform", length(unique(network_data$Platform))))
)
links_df <- data.frame(
source = match(network_data$Watch.Reason, all_nodes) - 1,
target = match(network_data$Platform, all_nodes) - 1,
value = network_data$value
)
# Sankey diagram
sankeyNetwork(
Links = links_df,
Nodes = nodes_df,
Source = "source",
Target = "target",
Value = "value",
NodeID = "name",
NodeGroup = "group",
colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20)"),
sinksRight = TRUE,
nodeWidth = 30,
nodePadding = 10,
height = 500,
width = 800,
fontSize = 12
)
This Sankey diagram illustrates how different user motivations for social media usage map to specific platform preferences:
bubble_data <- data %>%
group_by(Platform, Video.Category) %>%
summarise(
Avg_Addiction = mean(Addiction.Level, na.rm = TRUE),
Total_Videos = sum(Number.of.Videos.Watched, na.rm = TRUE),
Avg_Time = mean(Total.Time.Spent, na.rm = TRUE),
Count = n(),
.groups = 'drop'
)
bubble_plot <- plot_ly(
data = bubble_data,
x = ~Avg_Time,
y = ~Avg_Addiction,
size = ~Total_Videos,
color = ~Platform,
text = ~paste0(
"<b>Platform:</b> ", Platform,
"<br><b>Category:</b> ", Video.Category,
"<br><b>Avg Time:</b> ", round(Avg_Time, 1), " min",
"<br><b>Avg Addiction:</b> ", round(Avg_Addiction, 2),
"<br><b>Total Videos:</b> ", Total_Videos,
"<br><b>Users:</b> ", Count
),
type = 'scatter',
mode = 'markers',
marker = list(opacity = 0.8, sizemode = 'diameter', line = list(width = 1, color = '#333')),
sizes = c(10, 80)
) %>%
layout(
title = list(text = "<b>Relationship Between Time Spent, Addiction Level, and Videos Watched</b>"),
xaxis = list(title = "<b>Average Time Spent (minutes)</b>", zeroline = FALSE),
yaxis = list(title = "<b>Average Addiction Level</b>", zeroline = FALSE, range = c(1.5, 5.2)),
legend = list(title = list(text = "<b>Platform</b>"), orientation = "v", x = 1.02),
hoverlabel = list(bgcolor = "white", font = list(size = 12)),
margin = list(r = 120, t = 80)
)
bubble_plot
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
This bubble chart illustrates how average time spent, average addiction level, and number of videos watched vary across platforms:
interactive_3d <- plot_ly(
data,
x = ~Total.Time.Spent,
y = ~Self.Control,
z = ~Addiction.Level,
color = ~Platform,
colors = viridis(length(unique(data$Platform))),
type = "scatter3d",
mode = "markers",
marker = list(
size = 5,
opacity = 0.8,
line = list(width = 0.5, color = 'black')
),
text = ~paste(
"<b>Platform:</b>", Platform,
"<br><b>Age:</b>", Age,
"<br><b>Gender:</b>", Gender_fixed,
"<br><b>Videos Watched:</b>", Number.of.Videos.Watched,
"<br><b>Time Spent:</b>", Total.Time.Spent,
"<br><b>Self Control:</b>", Self.Control,
"<br><b>Addiction:</b>", Addiction.Level
),
hoverinfo = "text"
) %>%
layout(
title = list(
text = "<b>3D Relationship: Time Spent, Self Control, and Addiction</b>",
x = 0.5
),
scene = list(
xaxis = list(title = "<b>Total Time Spent</b>"),
yaxis = list(title = "<b>Self Control</b>"),
zaxis = list(title = "<b>Addiction Level</b>")
),
legend = list(title = list(text = "<b>Platform</b>")),
margin = list(l = 0, r = 0, b = 0, t = 80)
)
interactive_3d
This 3D scatter plot reveals the multi-dimensional interaction between:
This heatmap visualizes how average Addiction Level varies across combinations of social media platforms and video content categories.
heatmap_data <- data %>%
group_by(Platform, Video.Category) %>%
summarise(Avg_Addiction = mean(Addiction.Level, na.rm = TRUE))
## `summarise()` has grouped output by 'Platform'. You can override using the
## `.groups` argument.
# Add formatted value column
heatmap_data$Label <- round(heatmap_data$Avg_Addiction, 1)
# Plot with value labels
heat_plot <- ggplot(heatmap_data, aes(x = Platform, y = Video.Category, fill = Avg_Addiction)) +
geom_tile() +
geom_text(aes(label = Label), color = "white", size = 3) + # value labels
scale_fill_viridis_c(option = "D") +
labs(title = "Average Addiction Level by Platform and Video Category",
x = "Platform",
y = "Video Category",
fill = "Avg Addiction") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
print(heat_plot)
This faceted scatter plot visualizes the relationship between Total Time Spent (minutes) on each platform and the corresponding Addiction Level reported by users.
time_addiction_plot <- ggplot(data, aes(x = Total.Time.Spent, y = Addiction.Level, color = Platform)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_viridis_d() +
facet_wrap(~Platform) +
labs(title = "Relationship Between Time Spent and Addiction Level",
x = "Total Time Spent (minutes)",
y = "Addiction Level") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
print(time_addiction_plot)
## `geom_smooth()` using formula = 'y ~ x'
| Platform | Trend Direction | Visual Pattern | Interpretation |
|---|---|---|---|
| Flat (~0 slope) | Widely scattered, no trend | Time spent does not predict addiction | |
| Slightly upward | Moderate vertical spread | Small positive association | |
| TikTok | Flat | Dense at mid-levels | Weak relationship |
| YouTube | Slight downward | Broad spread across time | Time spent may slightly decrease addiction |
This radar chart visually compares normalized averages (0 to 1) of key behavioral and psychological metrics across four platforms: Facebook, Instagram, TikTok, and YouTube.
platform_summary <- data %>%
group_by(Platform) %>%
summarise(
Age = mean(Age),
Time_Spent = mean(Total.Time.Spent),
Self_Control = mean(Self.Control),
Addiction_Level = mean(Addiction.Level),
Videos_Watched = mean(Number.of.Videos.Watched)
)
# Reshape to long format
radar_data <- platform_summary %>%
pivot_longer(cols = -Platform, names_to = "Metric", values_to = "Value") %>%
group_by(Metric) %>%
mutate(Scaled = (Value - min(Value)) / (max(Value) - min(Value))) %>%
ungroup()
# radar chart
fig <- plot_ly(type = 'scatterpolar', mode = 'lines+markers')
platforms <- unique(radar_data$Platform)
for (p in platforms) {
fig <- fig %>%
add_trace(
r = radar_data %>% filter(Platform == p) %>% pull(Scaled),
theta = radar_data %>% filter(Platform == p) %>% pull(Metric),
name = p
)
}
fig <- fig %>%
layout(
polar = list(radialaxis = list(visible = TRUE, range = c(0, 1))),
title = "Platform Comparison Across Key Metrics"
)
fig
Each axis represents one metric, and values are scaled (normalized) so the maximum observed value for each metric equals 1.
| Platform | High in… | Low in… | Behavioral Summary |
|---|---|---|---|
| Time Spent | Videos Watched, Addiction | High session time, passive engagement | |
| Videos Watched, Addiction | Self-Control | Heavy scrollers with higher addiction risk | |
| TikTok | Self-Control, Videos Watched | Age | Younger audience with more control, high content volume |
| YouTube | Older Users, Balanced profile | Time Spent | Older demographic with moderate use and control |
# Descriptive stats
desc_stats <- describeBy(data[, c("Age", "Total.Time.Spent", "Video.Length",
"Number.of.Videos.Watched", "Self.Control",
"Addiction.Level")],
group = data$Platform,
mat = TRUE)
print(desc_stats)
## item group1 vars n mean sd median
## Age1 1 Facebook 1 187 40.385027 13.074698 41.0
## Age2 2 Instagram 1 233 40.223176 12.804295 41.0
## Age3 3 TikTok 1 226 40.995575 13.751605 42.5
## Age4 4 YouTube 1 220 39.859091 14.771723 39.0
## Total.Time.Spent1 5 Facebook 2 187 157.272727 82.279327 160.0
## Total.Time.Spent2 6 Instagram 2 233 142.652361 87.082126 131.0
## Total.Time.Spent3 7 TikTok 2 226 146.469027 85.082921 138.0
## Total.Time.Spent4 8 YouTube 2 220 145.431818 84.842373 145.0
## Video.Length1 9 Facebook 3 187 14.914439 8.171097 15.0
## Video.Length2 10 Instagram 3 233 15.489270 8.136291 16.0
## Video.Length3 11 TikTok 3 226 16.123894 16.166705 14.0
## Video.Length4 12 YouTube 3 220 70.390909 405.122691 15.0
## Number.of.Videos.Watched1 13 Facebook 4 187 24.080214 13.849381 24.0
## Number.of.Videos.Watched2 14 Instagram 4 233 26.240343 14.083272 26.0
## Number.of.Videos.Watched3 15 TikTok 4 226 25.615044 13.488680 26.0
## Number.of.Videos.Watched4 16 YouTube 4 220 25.572727 14.572106 26.5
## Self.Control1 17 Facebook 5 187 6.962567 2.040896 7.0
## Self.Control2 18 Instagram 5 233 6.952790 2.084922 7.0
## Self.Control3 19 TikTok 5 226 7.123894 2.106636 8.0
## Self.Control4 20 YouTube 5 220 7.077273 2.088987 7.0
## Addiction.Level1 21 Facebook 6 187 3.037433 2.040896 3.0
## Addiction.Level2 22 Instagram 6 233 3.081545 2.100339 3.0
## Addiction.Level3 23 TikTok 6 226 2.982301 2.256772 2.0
## Addiction.Level4 24 YouTube 6 220 3.031818 2.116231 3.0
## trimmed mad min max range skew
## Age1 40.264901 16.3086 18 64 46 0.025954702
## Age2 40.128342 16.3086 18 64 46 0.011087031
## Age3 40.989011 17.7912 18 64 46 -0.030872298
## Age4 39.534091 21.4977 18 64 46 0.114520815
## Total.Time.Spent1 158.331126 97.8516 10 298 288 -0.121073678
## Total.Time.Spent2 139.994652 109.7124 10 296 286 0.224466280
## Total.Time.Spent3 145.093407 111.1950 11 298 287 0.081361323
## Total.Time.Spent4 143.119318 110.4537 12 297 285 0.153202672
## Video.Length1 14.821192 10.3782 1 29 28 0.094254494
## Video.Length2 15.604278 10.3782 1 29 28 -0.128402949
## Video.Length3 14.395604 10.3782 1 120 119 4.625284959
## Video.Length4 15.630682 11.8608 1 3040 3039 7.158396681
## Number.of.Videos.Watched1 24.099338 17.7912 1 49 48 0.003003194
## Number.of.Videos.Watched2 26.347594 17.7912 1 49 48 -0.015468131
## Number.of.Videos.Watched3 25.626374 16.3086 1 49 48 0.023000716
## Number.of.Videos.Watched4 25.437500 18.5325 1 80 79 0.184420179
## Self.Control1 7.013245 2.9652 3 10 7 -0.115740124
## Self.Control2 6.989305 2.9652 3 10 7 -0.031322870
## Self.Control3 7.225275 2.9652 3 10 7 -0.247795718
## Self.Control4 7.187500 2.9652 3 10 7 -0.299495091
## Addiction.Level1 2.986755 2.9652 0 7 7 0.115740124
## Addiction.Level2 3.053476 2.9652 0 7 7 -0.004399726
## Addiction.Level3 2.840659 2.9652 0 9 9 0.414555499
## Addiction.Level4 2.926136 2.9652 0 8 8 0.344059131
## kurtosis se
## Age1 -1.1186707 0.9561166
## Age2 -1.1123414 0.8388373
## Age3 -1.1612212 0.9147432
## Age4 -1.3822243 0.9959094
## Total.Time.Spent1 -1.2016233 6.0168603
## Total.Time.Spent2 -1.1973012 5.7049398
## Total.Time.Spent3 -1.2583456 5.6596317
## Total.Time.Spent4 -1.2697523 5.7200716
## Video.Length1 -1.0282332 0.5975298
## Video.Length2 -1.1246044 0.5330262
## Video.Length3 27.2981470 1.0753932
## Video.Length4 49.4917838 27.3133662
## Number.of.Videos.Watched1 -1.1624879 1.0127671
## Number.of.Videos.Watched2 -1.2007111 0.9226258
## Number.of.Videos.Watched3 -1.0903210 0.8972536
## Number.of.Videos.Watched4 -0.4742684 0.9824512
## Self.Control1 -0.8926523 0.1492451
## Self.Control2 -1.1386202 0.1365878
## Self.Control3 -1.0172390 0.1401313
## Self.Control4 -0.8793780 0.1408394
## Addiction.Level1 -0.8926523 0.1492451
## Addiction.Level2 -1.1811552 0.1375978
## Addiction.Level3 -0.6261240 0.1501182
## Addiction.Level4 -0.7219355 0.1426763
desc_stats_df <- as.data.frame(desc_stats)
desc_stats_table <- desc_stats_df %>%
select(group1, vars, n, mean, sd, median, min, max, skew, kurtosis) %>%
rename(
Platform = group1,
Variable = vars,
Count = n,
Mean = mean,
SD = sd,
Median = median,
Min = min,
Max = max,
Skewness = skew,
Kurtosis = kurtosis
)
print("Formatted Descriptive Statistics Table:")
## [1] "Formatted Descriptive Statistics Table:"
print(desc_stats_table)
## Platform Variable Count Mean SD Median
## Age1 Facebook 1 187 40.385027 13.074698 41.0
## Age2 Instagram 1 233 40.223176 12.804295 41.0
## Age3 TikTok 1 226 40.995575 13.751605 42.5
## Age4 YouTube 1 220 39.859091 14.771723 39.0
## Total.Time.Spent1 Facebook 2 187 157.272727 82.279327 160.0
## Total.Time.Spent2 Instagram 2 233 142.652361 87.082126 131.0
## Total.Time.Spent3 TikTok 2 226 146.469027 85.082921 138.0
## Total.Time.Spent4 YouTube 2 220 145.431818 84.842373 145.0
## Video.Length1 Facebook 3 187 14.914439 8.171097 15.0
## Video.Length2 Instagram 3 233 15.489270 8.136291 16.0
## Video.Length3 TikTok 3 226 16.123894 16.166705 14.0
## Video.Length4 YouTube 3 220 70.390909 405.122691 15.0
## Number.of.Videos.Watched1 Facebook 4 187 24.080214 13.849381 24.0
## Number.of.Videos.Watched2 Instagram 4 233 26.240343 14.083272 26.0
## Number.of.Videos.Watched3 TikTok 4 226 25.615044 13.488680 26.0
## Number.of.Videos.Watched4 YouTube 4 220 25.572727 14.572106 26.5
## Self.Control1 Facebook 5 187 6.962567 2.040896 7.0
## Self.Control2 Instagram 5 233 6.952790 2.084922 7.0
## Self.Control3 TikTok 5 226 7.123894 2.106636 8.0
## Self.Control4 YouTube 5 220 7.077273 2.088987 7.0
## Addiction.Level1 Facebook 6 187 3.037433 2.040896 3.0
## Addiction.Level2 Instagram 6 233 3.081545 2.100339 3.0
## Addiction.Level3 TikTok 6 226 2.982301 2.256772 2.0
## Addiction.Level4 YouTube 6 220 3.031818 2.116231 3.0
## Min Max Skewness Kurtosis
## Age1 18 64 0.025954702 -1.1186707
## Age2 18 64 0.011087031 -1.1123414
## Age3 18 64 -0.030872298 -1.1612212
## Age4 18 64 0.114520815 -1.3822243
## Total.Time.Spent1 10 298 -0.121073678 -1.2016233
## Total.Time.Spent2 10 296 0.224466280 -1.1973012
## Total.Time.Spent3 11 298 0.081361323 -1.2583456
## Total.Time.Spent4 12 297 0.153202672 -1.2697523
## Video.Length1 1 29 0.094254494 -1.0282332
## Video.Length2 1 29 -0.128402949 -1.1246044
## Video.Length3 1 120 4.625284959 27.2981470
## Video.Length4 1 3040 7.158396681 49.4917838
## Number.of.Videos.Watched1 1 49 0.003003194 -1.1624879
## Number.of.Videos.Watched2 1 49 -0.015468131 -1.2007111
## Number.of.Videos.Watched3 1 49 0.023000716 -1.0903210
## Number.of.Videos.Watched4 1 80 0.184420179 -0.4742684
## Self.Control1 3 10 -0.115740124 -0.8926523
## Self.Control2 3 10 -0.031322870 -1.1386202
## Self.Control3 3 10 -0.247795718 -1.0172390
## Self.Control4 3 10 -0.299495091 -0.8793780
## Addiction.Level1 0 7 0.115740124 -0.8926523
## Addiction.Level2 0 7 -0.004399726 -1.1811552
## Addiction.Level3 0 9 0.414555499 -0.6261240
## Addiction.Level4 0 8 0.344059131 -0.7219355
print("Shapiro-Wilk Normality Tests:")
## [1] "Shapiro-Wilk Normality Tests:"
shapiro_results <- data %>%
group_by(Platform) %>%
summarise(
Addiction_Level_SW_p = shapiro.test(Addiction.Level)$p.value,
Time_Spent_SW_p = shapiro.test(Total.Time.Spent)$p.value,
Videos_Watched_SW_p = shapiro.test(Number.of.Videos.Watched)$p.value,
Self_Control_SW_p = shapiro.test(Self.Control)$p.value
)
print(shapiro_results)
## # A tibble: 4 × 5
## Platform Addiction_Level_SW_p Time_Spent_SW_p Videos_Watched_SW_p
## <fct> <dbl> <dbl> <dbl>
## 1 Facebook 7.34e- 9 0.00000808 0.0000223
## 2 Instagram 2.10e-10 0.0000000361 0.000000757
## 3 TikTok 9.93e-10 0.000000221 0.0000156
## 4 YouTube 2.03e- 8 0.0000000830 0.0000185
## # ℹ 1 more variable: Self_Control_SW_p <dbl>
Addiction.LevelTotal.Time.SpentNumber.of.Videos.WatchedSelf.Controlcolnames(data)
## [1] "Age" "Gender"
## [3] "Location" "Profession"
## [5] "Platform" "Total.Time.Spent"
## [7] "Video.Category" "Video.Length"
## [9] "Number.of.Videos.Watched" "Frequency"
## [11] "Watch.Reason" "DeviceType"
## [13] "Watch.Time" "Self.Control"
## [15] "Addiction.Level" "CurrentActivity"
## [17] "Hour" "Gender_fixed"
# Correlation analysis for numeric variables with significance testing
numeric_data <- data %>% select(Age, Total.Time.Spent,
Video.Length,
Number.of.Videos.Watched,
Self.Control,
Addiction.Level,
Total.Time.Spent)
correlation_matrix <- cor(numeric_data, use = "complete.obs")
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(correlation_matrix)
## Age Total.Time.Spent Video.Length
## Age 1.00000000 0.04478595 -0.088644457
## Total.Time.Spent 0.04478595 1.00000000 -0.091464950
## Video.Length -0.08864446 -0.09146495 1.000000000
## Number.of.Videos.Watched -0.01337237 -0.01576289 0.109779786
## Self.Control -0.01420426 -0.01611920 -0.131498679
## Addiction.Level -0.02266599 -0.01029117 0.005144647
## Number.of.Videos.Watched Self.Control Addiction.Level
## Age -0.01337237 -0.01420426 -0.022665990
## Total.Time.Spent -0.01576289 -0.01611920 -0.010291166
## Video.Length 0.10977979 -0.13149868 0.005144647
## Number.of.Videos.Watched 1.00000000 -0.04012273 0.014965231
## Self.Control -0.04012273 1.00000000 -0.953799270
## Addiction.Level 0.01496523 -0.95379927 1.000000000
# Function to calculate p-values for correlations
cor.mtest <- function(mat, conf.level = 0.95) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
test <- cor.test(mat[, i], mat[, j], conf.level = conf.level)
p.mat[i, j] <- p.mat[j, i] <- test$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
return(p.mat)
}
# Compute p-values
p_values <- cor.mtest(numeric_data)
corrplot(
correlation_matrix,
method = "color",
type = "upper",
addCoef.col = "black", # Show correlation coefficients
tl.col = "black", # Label color
tl.cex = 0.8, # Label size
number.cex = 0.7, # Coefficient size
col = colorRampPalette(c("red", "white", "blue"))(200),
mar = c(0, 0, 1, 0),
# Show all correlations regardless of significance
p.mat = p_values,
sig.level = 0.05,
insig = "n", # ⬅️ This ensures NOTHING is hidden or crossed
title = "Correlation Matrix of Numeric Variables with Significance",
)
This section visualizes the pairwise relationships between key
numeric variables such as Age,
Total Time Spent, Video Length,
Number of Videos Watched, Self-Control,
Addiction Level, and Hour. The Pearson
correlation coefficient is used to measure linear relationships, and all
correlations are displayed regardless of significance.
cor()) was used
to compute pairwise relationships.cor.test() for each variable pair.p.mat) was
passed to corrplot() using
sig.level = 0.05.insig = "n"
to retain insignificant results.| Variable Pair | Correlation (r) |
|---|---|
| Self.Control vs Addiction.Level | -0.95 |
| Self.Control vs Hour | -0.57 |
| Addiction.Level vs Hour | 0.59 |
| Video.Length vs Number.of.Videos.Watched | 0.11 |
| Video.Length vs Self.Control | -0.13 |
| Video.Length vs Addiction.Level | -0.15 |
| All others | |
Note: While some correlations are moderate to strong (e.g., -0.95), many others are weak (< ±0.1), even if visually noticeable.
The strongest and most meaningful correlation is between Self-Control and Addiction Level (r = -0.95), indicating a strong negative relationship. Users with lower self-control scores report significantly higher addiction levels.
Hour of Use (time of day when users most engage) has:
Video Length shows:
Age, Total Time Spent, and Number of Videos Watched show minimal correlations, suggesting they are not strong individual predictors of addictive behavior.
scatter_regression <- ggplot(data, aes(x = Self.Control, y = Addiction.Level, color = Platform)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ Platform, scales = "free") +
scale_color_viridis_d() +
labs(title = "Self Control vs Addiction Level by Platform",
subtitle = "With linear regression lines and 95% confidence intervals",
x = "Self Control",
y = "Addiction Level") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "none"
)
print(scatter_regression)
## `geom_smooth()` using formula = 'y ~ x'
This section visualizes the relationship between Self-Control and Addiction Level across four major social media platforms: Facebook, Instagram, TikTok, and YouTube. Each subplot includes a linear regression line with 95% confidence intervals, helping to assess the strength and consistency of the inverse relationship.
Addiction Level from
Self-Control.This multi-platform analysis reinforces the central finding:
> Higher self-control is associated with lower reported
addiction levels on all platforms.
While some platforms (like YouTube) show slightly more variability, the inverse relationship is consistent, suggesting that self-regulation is a key predictor of social media addiction, regardless of the specific platform in use.
platform_reason_table <- table(data$Platform, data$Watch.Reason)
print("Contingency Table - Platform by Watch Reason:")
## [1] "Contingency Table - Platform by Watch Reason:"
print(platform_reason_table)
##
## Boredom Entertainment Habit Procrastination
## Facebook 45 44 71 27
## Instagram 65 53 81 34
## TikTok 60 72 72 22
## YouTube 60 66 65 29
This analysis examines whether the choice of social media platform is independent of the user’s reason for watching.
These differences suggest that user motivations vary meaningfully by platform: - TikTok may be more entertainment-driven - Instagram and Facebook usage skews toward habitual and boredom-induced - YouTube usage is more balanced, likely due to its mixed content styles
chi_test_platform_reason <- chisq.test(platform_reason_table)
print("Chi-square test for Platform and Watch Reason:")
## [1] "Chi-square test for Platform and Watch Reason:"
print(chi_test_platform_reason)
##
## Pearson's Chi-squared test
##
## data: platform_reason_table
## X-squared = 10.772, df = 9, p-value = 0.2917
This test evaluates whether the type of platform used is associated with the user’s reason for watching (e.g., boredom, entertainment, habit, procrastination).
p-value = 0.2917 > 0.05
→ The result is not statistically significant at the 5%
level.
This means there is no strong evidence to reject the null hypothesis that Platform and Watch Reason are independent.
There is no significant association between Platform and Watch Reason.
platform_category_table <- table(data$Platform, data$Video.Category)
print("Contingency Table - Platform by Video Category:")
## [1] "Contingency Table - Platform by Video Category:"
print(platform_category_table)
##
## ASMR Comedy Entertainment Gaming Jokes/Memes Life Hacks Pranks
## Facebook 18 4 19 20 33 25 24
## Instagram 19 10 30 28 36 36 24
## TikTok 11 6 25 29 48 35 23
## YouTube 22 7 28 24 35 37 26
##
## Trends Vlogs
## Facebook 18 26
## Instagram 24 26
## TikTok 19 30
## YouTube 24 17
chi_test_platform_category <- chisq.test(platform_category_table)
print("Chi-square test for Platform and Video Category:")
## [1] "Chi-square test for Platform and Video Category:"
print(chi_test_platform_category)
##
## Pearson's Chi-squared test
##
## data: platform_category_table
## X-squared = 17.156, df = 24, p-value = 0.842
This test examines whether there is a significant relationship between the social media platform used and the type of video content consumed (e.g., Gaming, Vlogs, Memes).
p-value = 0.842 > 0.05
→ The result is not statistically significant.
This indicates that there is no strong evidence to reject the null hypothesis that Platform and Video Category are independent.
There is no significant association between platform and video category.
gender_platform_table <- table(data$Gender_fixed, data$Platform)
print("Contingency Table - Gender by Platform:")
## [1] "Contingency Table - Gender by Platform:"
print(gender_platform_table)
##
## Facebook Instagram TikTok YouTube
## Female 72 92 87 75
## Male 115 141 139 145
chi_test_gender_platform <- chisq.test(gender_platform_table)
print("Chi-square test for Gender and Platform:")
## [1] "Chi-square test for Gender and Platform:"
print(chi_test_gender_platform)
##
## Pearson's Chi-squared test
##
## data: gender_platform_table
## X-squared = 1.6482, df = 3, p-value = 0.6485
This test assesses whether there is a statistically significant association between a user’s gender and their preferred social media platform.
p-value = 0.6485 > 0.05
→ The test is not statistically significant.
This means there is no sufficient evidence to suggest that platform preference is influenced by gender in this dataset.
levene_test_result <- leveneTest(Addiction.Level ~ Platform, data = data)
print("Levene's Test for Homogeneity of Variance:")
## [1] "Levene's Test for Homogeneity of Variance:"
print(levene_test_result)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.1469 0.3292
## 862
Before performing an ANOVA, it’s essential to check whether the assumption of homogeneity of variance (equal variances across groups) is met. This is tested using Levene’s Test.
p-value = 0.3292 > 0.05
→ The result is not statistically significant.
This means we fail to reject the null hypothesis of equal variances. Therefore, the assumption of homogeneity of variance is satisfied.
Addiction.Level grouped by
Platform.two_way_anova <- aov(Addiction.Level ~ Platform * Gender_fixed, data = data)
print("Two-way ANOVA: Platform and Gender on Addiction Level")
## [1] "Two-way ANOVA: Platform and Gender on Addiction Level"
print(summary(two_way_anova))
## Df Sum Sq Mean Sq F value Pr(>F)
## Platform 3 1 0.378 0.083 0.969
## Gender_fixed 1 1 1.432 0.314 0.576
## Platform:Gender_fixed 3 6 2.129 0.466 0.706
## Residuals 858 3917 4.565
This two-way ANOVA evaluates whether Addiction Level
is significantly affected by: 1. Platform (Facebook,
Instagram, TikTok, YouTube) 2. Gender (via
Gender_fixed) 3. The interaction between
Platform and Gender
Platform effect: p =
0.969
→ No significant difference in mean Addiction Level across
platforms.
Gender effect: p = 0.576
→ No significant difference in Addiction Level between genders.
Interaction effect (Platform × Gender): p =
0.706
→ No significant combined effect of Platform and Gender on Addiction
Level.
Despite previous visual trends (e.g., higher addiction on Instagram, differences by gender), this statistical test indicates that:
levene_test_gender <- leveneTest(Addiction.Level ~ Gender_fixed, data = data)
print("Levene's Test for Homogeneity of Variance (Gender):")
## [1] "Levene's Test for Homogeneity of Variance (Gender):"
print(levene_test_gender)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.8338 0.176
## 864
# Perform t-test with appropriate method based on variance equality
if (levene_test_gender$`Pr(>F)`[1] < 0.05) {
# Unequal variances
t_test_result <- t.test(Addiction.Level ~ Gender_fixed, data = data, var.equal = FALSE)
print("Welch's T-test for Addiction Level by Gender (unequal variances):")
} else {
# Equal variances
t_test_result <- t.test(Addiction.Level ~ Gender_fixed, data = data, var.equal = TRUE)
print("Student's T-test for Addiction Level by Gender (equal variances):")
}
## [1] "Student's T-test for Addiction Level by Gender (equal variances):"
This section investigates whether Addiction Level differs significantly between males and females, using a two-sample t-test.
print(t_test_result)
##
## Two Sample t-test
##
## data: Addiction.Level by Gender_fixed
## t = -0.55671, df = 864, p-value = 0.5779
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -0.3766144 0.2101749
## sample estimates:
## mean in group Female mean in group Male
## 2.981595 3.064815
# effect size for t-test (Cohen's d)
cohens_d <- cohens_d(Addiction.Level ~ Gender_fixed, data = data)
print("Effect Size (Cohen's d) for Gender Difference:")
## [1] "Effect Size (Cohen's d) for Gender Difference:"
print(cohens_d)
## Cohen's d | 95% CI
## -------------------------
## -0.04 | [-0.18, 0.10]
##
## - Estimated using pooled SD.
The difference in Addiction Level between females and males is not meaningful in practical terms.
gender_boxplot <- ggplot(data, aes(x = Gender_fixed, y = Addiction.Level, fill = Gender_fixed)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.2) +
labs(title = "Addiction Level by Gender",
subtitle = paste("p-value =", round(t_test_result$p.value, 3),
", Cohen's d =", round(cohens_d$Cohens_d, 2)),
y = "Addiction Level") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
print(gender_boxplot)
The plot shows: - Nearly identical median values for males and females. - Similar spread (IQR) and distribution. - No visible difference in central tendency or spread.
This regression model aims to predict users’ Addiction Level based on demographic, behavioral, and platform-related variables.
# Multiple linear regression model to predict Addiction Level
# Create a basic model first
basic_model <- lm(Addiction.Level ~ Age + Gender_fixed + Total.Time.Spent +
Number.of.Videos.Watched + Self.Control + Platform,
data = data)
print("Basic Linear Regression Model:")
## [1] "Basic Linear Regression Model:"
print(summary(basic_model))
##
## Call:
## lm(formula = Addiction.Level ~ Age + Gender_fixed + Total.Time.Spent +
## Number.of.Videos.Watched + Self.Control + Platform, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1655 -0.1464 -0.0533 0.0391 5.7468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.2756917 0.1245439 82.507 < 2e-16 ***
## Age -0.0055713 0.0015825 -3.521 0.000453 ***
## Gender_fixedMale -0.0127821 0.0444243 -0.288 0.773624
## Total.Time.Spent -0.0005965 0.0002539 -2.350 0.019015 *
## Number.of.Videos.Watched -0.0037650 0.0015397 -2.445 0.014676 *
## Self.Control -0.9796563 0.0103610 -94.553 < 2e-16 ***
## PlatformInstagram 0.0329182 0.0623052 0.528 0.597402
## PlatformTikTok 0.1056494 0.0626709 1.686 0.092202 .
## PlatformYouTube 0.1029473 0.0630771 1.632 0.103028
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6326 on 857 degrees of freedom
## Multiple R-squared: 0.9127, Adjusted R-squared: 0.9118
## F-statistic: 1119 on 8 and 857 DF, p-value: < 2.2e-16
The model explains approximately 91% of the variance in Addiction Level — a strong fit. The overall model is highly significant (p < 0.001), suggesting it effectively captures key predictors.
# Checking for multicollinearity
vif_values <- vif(basic_model)
print("Variance Inflation Factors (check for multicollinearity):")
## [1] "Variance Inflation Factors (check for multicollinearity):"
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.003327 1 1.001662
## Gender_fixed 1.002553 1 1.001276
## Total.Time.Spent 1.006334 1 1.003162
## Number.of.Videos.Watched 1.005058 1 1.002526
## Self.Control 1.003916 1 1.001956
## Platform 1.011048 3 1.001833
Before trusting a regression model’s coefficients, it’s essential to check for multicollinearity—when predictor variables are highly correlated with one another, which can distort estimates.
# Stepwise regression to find the best model
step_model <- step(basic_model, direction = "both")
## Start: AIC=-784.22
## Addiction.Level ~ Age + Gender_fixed + Total.Time.Spent + Number.of.Videos.Watched +
## Self.Control + Platform
##
## Df Sum of Sq RSS AIC
## - Gender_fixed 1 0.0 343.0 -786.14
## - Platform 3 1.7 344.6 -785.92
## <none> 342.9 -784.22
## - Total.Time.Spent 1 2.2 345.1 -780.66
## - Number.of.Videos.Watched 1 2.4 345.3 -780.20
## - Age 1 5.0 347.9 -773.79
## - Self.Control 1 3577.4 3920.4 1323.71
##
## Step: AIC=-786.14
## Addiction.Level ~ Age + Total.Time.Spent + Number.of.Videos.Watched +
## Self.Control + Platform
##
## Df Sum of Sq RSS AIC
## - Platform 3 1.7 344.7 -787.86
## <none> 343.0 -786.14
## + Gender_fixed 1 0.0 342.9 -784.22
## - Total.Time.Spent 1 2.2 345.2 -782.57
## - Number.of.Videos.Watched 1 2.4 345.4 -782.11
## - Age 1 5.0 347.9 -775.70
## - Self.Control 1 3578.8 3921.8 1322.03
##
## Step: AIC=-787.86
## Addiction.Level ~ Age + Total.Time.Spent + Number.of.Videos.Watched +
## Self.Control
##
## Df Sum of Sq RSS AIC
## <none> 344.7 -787.86
## + Platform 3 1.7 343.0 -786.14
## + Gender_fixed 1 0.0 344.6 -785.92
## - Number.of.Videos.Watched 1 2.3 347.0 -784.10
## - Total.Time.Spent 1 2.3 347.0 -783.98
## - Age 1 4.9 349.6 -777.57
## - Self.Control 1 3578.2 3922.8 1316.26
print("Stepwise Model Selection Results:")
## [1] "Stepwise Model Selection Results:"
print(summary(step_model))
##
## Call:
## lm(formula = Addiction.Level ~ Age + Total.Time.Spent + Number.of.Videos.Watched +
## Self.Control, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1287 -0.1407 -0.0550 0.0294 5.7960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.3244960 0.1145341 90.143 < 2e-16 ***
## Age -0.0055503 0.0015820 -3.508 0.000474 ***
## Total.Time.Spent -0.0006139 0.0002534 -2.422 0.015631 *
## Number.of.Videos.Watched -0.0036875 0.0015377 -2.398 0.016696 *
## Self.Control -0.9788672 0.0103535 -94.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6327 on 861 degrees of freedom
## Multiple R-squared: 0.9122, Adjusted R-squared: 0.9118
## F-statistic: 2237 on 4 and 861 DF, p-value: < 2.2e-16
Stepwise regression (with both forward and backward selection) was used to identify the most parsimonious model for predicting Addiction Level while maximizing explanatory power.
Addiction.Level ~ Age + Total.Time.Spent + Number.of.Videos.Watched + Self.Control
# Residual diagnostics for final model
par(mfrow = c(2, 2))
plot(step_model)
par(mfrow = c(1, 1))
To ensure that the assumptions of linear regression are satisfied, four standard diagnostic plots of the residuals from the final model are analyzed.
# Test for heteroscedasticity
bp_test <- bptest(step_model)
print("Breusch-Pagan Test for Heteroscedasticity:")
## [1] "Breusch-Pagan Test for Heteroscedasticity:"
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: step_model
## BP = 46.491, df = 4, p-value = 1.947e-09
This test evaluates whether the residuals from the regression model have constant variance — a key assumption in linear regression. Violating this (i.e., heteroscedasticity) can affect the validity of coefficient estimates and their standard errors.
p-value = 1.95 × 10⁻⁹ < 0.05
→ The result is highly significant.
This means we reject the null hypothesis of homoscedasticity (constant variance of residuals).
Therefore, the model exhibits heteroscedasticity — the variance of errors is not constant across all levels of fitted values.
# Calculate robust standard errors using HC1 estimator
robust_se <- vcovHC(step_model, type = "HC1")
# Recalculate coefficient tests with robust SEs
robust_coeftest <- coeftest(step_model, vcov. = robust_se)
print("Robust Coefficient Estimates (with HC1-adjusted standard errors):")
## [1] "Robust Coefficient Estimates (with HC1-adjusted standard errors):"
print(robust_coeftest)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.32449603 0.16185244 63.7896 < 2.2e-16 ***
## Age -0.00555034 0.00187412 -2.9616 0.003145 **
## Total.Time.Spent -0.00061387 0.00028284 -2.1704 0.030252 *
## Number.of.Videos.Watched -0.00368752 0.00156193 -2.3609 0.018454 *
## Self.Control -0.97886723 0.01107760 -88.3646 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Adding predicted values to the dataset
data$Predicted <- predict(step_model)
# Plot: Actual vs. Predicted
ggplot(data, aes(x = Predicted, y = Addiction.Level)) +
geom_point(alpha = 0.4, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = FALSE, linetype = "dashed") +
geom_abline(intercept = 0, slope = 1, linetype = "solid", color = "black") +
labs(title = "Model Fit: Predicted vs. Actual Addiction Level",
subtitle = "Dashed = Linear Trend | Solid = Perfect Fit Line",
x = "Predicted Addiction Level",
y = "Actual Addiction Level") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This section presents a visual evaluation of the model’s predictive performance by plotting the predicted values against the actual observed Addiction Level values.
# Extract Hour of Day from Watch.Time
hourly_data <- data %>%
mutate(Hour = hour(hms(Watch.Time))) %>%
count(Hour, Platform)
interactive_hourly <- plot_ly(
data = hourly_data,
x = ~Hour,
y = ~n,
color = ~Platform,
type = 'scatter',
mode = 'lines+markers',
line = list(width = 2),
marker = list(size = 7, line = list(width = 1, color = 'black')),
text = ~paste("Platform:", Platform,
"<br>Hour:", Hour,
"<br>Users:", n),
hoverinfo = 'text'
) %>%
layout(
title = list(
text = "<b>Video Watching Patterns by Hour of Day</b>",
x = 0.5
),
xaxis = list(
title = "<b>Hour of Day</b>",
tickmode = "array",
tickvals = seq(0, 23, 2),
ticktext = paste0(seq(0, 23, 2), ":00"),
range = c(0, 23),
dtick = 2
),
yaxis = list(title = "<b>Number of Users</b>", rangemode = "tozero"),
legend = list(title = list(text = "<b>Platform</b>")),
hovermode = "closest"
)
interactive_hourly
This interactive time series chart illustrates how user activity on different platforms varies throughout the day.
data$Location <- as.character(data$Location)
# Fix known typos
data$Location[data$Location == "Barzil"] <- "Brazil"
data$Location[data$Location == "Sri Lankan"] <- "Sri Lanka"
# Standardize country names
data$Country_Standardized <- countrycode(data$Location, "country.name", "country.name")
# summary by location
location_summary <- data %>%
filter(!is.na(Location)) %>% # filter out NA values
group_by(Location) %>%
summarise(
Count = n(),
Avg_Addiction = mean(Addiction.Level, na.rm = TRUE),
Avg_Time_Spent = mean(Total.Time.Spent, na.rm = TRUE),
Most_Popular_Platform = names(which.max(table(Platform)))
)
print(location_summary)
## # A tibble: 11 × 5
## Location Count Avg_Addiction Avg_Time_Spent Most_Popular_Platform
## <chr> <int> <dbl> <dbl> <chr>
## 1 Brazil 70 3.49 146. Instagram
## 2 Germany 51 3.12 129. Instagram
## 3 India 188 2.98 152. TikTok
## 4 Indonesia 65 2.88 137. YouTube
## 5 Japan 65 2.68 157. TikTok
## 6 Mexico 57 2.96 160. Facebook
## 7 Pakistan 66 2.52 156. Instagram
## 8 Philippines 66 2.82 168. TikTok
## 9 Sri Lanka 30 5.57 60.4 YouTube
## 10 United States 140 3.06 150. YouTube
## 11 Vietnam 68 2.74 147. Instagram
location_mapping <- data.frame(
dataset_name = unique(data$Location),
map_name = unique(data$Location),
stringsAsFactors = FALSE
)
# complete list of countries
all_countries <- data.frame(
country = unique(countrycode::codelist$country.name.en),
stringsAsFactors = FALSE
)
# Summarize the data
location_summary <- data %>%
group_by(Location) %>%
summarise(Avg_Addiction = mean(Addiction.Level, na.rm = TRUE)) %>%
rename(country = Location)
# Merge with the full country list
world_map_data <- all_countries %>%
left_join(location_summary, by = "country") %>%
rename(value = Avg_Addiction)
# choropleth plot
fig <- plot_geo(world_map_data)
fig <- fig %>%
add_trace(
locations = ~country,
locationmode = 'country names',
z = ~value,
text = ~paste("Country:", country, "<br>Avg Addiction:", round(value, 2)),
type = 'choropleth',
colorscale = 'Plasma',
zmin = 0,
zmax = 10,
colorbar = list(title = "Addiction Level"),
showscale = TRUE
) %>%
layout(
title = list(
text = "Global View: Average Social Media Addiction Level by Country",
x = 0.5
),
geo = list(
showframe = FALSE,
showcoastlines = TRUE,
projection = list(type = 'equirectangular')
)
)
fig
## Warning: Ignoring 280 observations
This project provided a detailed analysis of social media usage behavior, focusing on time-wasting patterns and addiction levels. Through data visualization and statistical techniques, several meaningful insights emerged.